High-speed and high-ratio referential genome compression
نویسندگان
چکیده
Motivation The rapidly increasing number of genomes generated by high-throughput sequencing platforms and assembly algorithms is accompanied by problems in data storage, compression and communication. Traditional compression algorithms are unable to meet the demand of high compression ratio due to the intrinsic challenging features of DNA sequences such as small alphabet size, frequent repeats and palindromes. Reference-based lossless compression, by which only the differences between two similar genomes are stored, is a promising approach with high compression ratio. Results We present a high-performance referential genome compression algorithm named HiRGC. It is based on a 2-bit encoding scheme and an advanced greedy-matching search on a hash table. We compare the performance of HiRGC with four state-of-the-art compression methods on a benchmark dataset of eight human genomes. HiRGC takes <30 min to compress about 21 gigabytes of each set of the seven target genomes into 96-260 megabytes, achieving compression ratios of 217 to 82 times. This performance is at least 1.9 times better than the best competing algorithm on its best case. Our compression speed is also at least 2.9 times faster. HiRGC is stable and robust to deal with different reference genomes. In contrast, the competing methods' performance varies widely on different reference genomes. More experiments on 100 human genomes from the 1000 Genome Project and on genomes of several other species again demonstrate that HiRGC's performance is consistently excellent. Availability and implementation The C ++ and Java source codes of our algorithm are freely available for academic and non-commercial use. They can be downloaded from https://github.com/yuansliu/HiRGC. Contact [email protected]. Supplementary information Supplementary data are available at Bioinformatics online.
منابع مشابه
Searching the Referentially-compressed Genomes by Incomplete Patterns
Genome banks contain precious biological information that is mostly not discovered yet. Biologists in turn are keen to precisely explore these banks in order to discover effective patterns (such as motifs and retro-transposons) that have a real impact on the function and evolution of living creatures. Because the modern genome sequencing technologies produce genomes in high throughputs, many te...
متن کاملSequence Factorization with Multiple References
The success of high-throughput sequencing has lead to an increasing number of projects which sequence large populations of a species. Storage and analysis of sequence data is a key challenge in these projects, because of the sheer size of the datasets. Compression is one simple technology to deal with this challenge. Referential factorization and compression schemes, which store only the differ...
متن کاملOn-Demand Indexing for Referential Compression of DNA Sequences
The decreasing costs of genome sequencing is creating a demand for scalable storage and processing tools and techniques to deal with the large amounts of generated data. Referential compression is one of these techniques, in which the similarity between the DNA of organisms of the same or an evolutionary close species is exploited to reduce the storage demands of genome sequences up to 700 time...
متن کاملStudy of Parameters Affecting Separation Bubble Size in High Speed Flows using k-ω Turbulence Model
Shock waves generated at different parts of vehicle interact with the boundary layer over the surface at high Mach flows. The adverse pressure gradient across strong shock wave causes the flow to separate and peak loads are generated at separation and reattachment points. The size of separation bubble in the shock boundary layer interaction flows depends on various parameters. Reynolds-averaged...
متن کاملEffect of Ignition Timing, Equivalence Ratio, and Compression Ratio on the Performance and Emission Characteristics of a Variable Compression Ratio Si Engine using Ethanol-Unleaded Gasoline Blends
This paper investigates the effect of ethanol-unleaded gasoline blends (E0,E10,E25,E35,andE65) computer interfaced, four-stroke single cylinder compression ignition engine. The said engine wasconverted to spark ignition and carburetion to suit ethanol fuel. A suitable provision was provided on theengine to vary the compression ratio thereby making the engine adaptable to operate at lowercompres...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Bioinformatics
دوره 33 21 شماره
صفحات -
تاریخ انتشار 2017